Data processing system having multiple processors and a communications means in a data processing system

ABSTRACT

Aspects involve effectively separating communication hardware in a data processing system by introducing a communication device for each processor. By introducing this separation the processors can concentrate on performing their function-specific tasks, while the communication device provide the communication support for the respective processor. Accordingly, in certain embodiments, a data processing system is provided with a computation, a communication support and a communication network layer.

The invention relates to a data processing system having multipleprocessors, and a communication means in a data processing system havingmultiple processors.

A heterogeneous multiprocessor architecture for high performance,data-dependent media processing e.g. for high-definition MPEG decodingis known. Media processing applications can be specified as a set ofconcurrently executing tasks that exchange information solely byunidirectional streams of data. G. Kahn introduced a formal model ofsuch applications already in 1974, ‘The Semantics of a Simple Languagefor Parallel Programming’, Proc. of the IFIP congress 74, August 5-10,Stockholm, Sweden, North-Holland publ. Co, 1974, pp. 471-475 followed byan operational description by Kahn and MacQueen in 1977, ‘Co-routinesand Networks of Parallel Programming’, Information Processing 77, B.Gilchhirst (Ed.), North-Holland publ., 1977, pp 993-998. This formalmodel is now commonly referred to as a Kahn Process Network.

An application is known as a set of concurrently executable tasks.Information can only be exchanged between tasks by unidirectionalstreams of data. Tasks should communicate only deterministically bymeans of a read and write process regarding predefined data streams. Thedata streams are buffered on the basis of a FIFO behaviour. Due to thebuffering two tasks communicating through a stream do not have tosynchronise on individual read or write processes

In stream processing, successive operations on a stream of data areperformed by different processors. For example a first stream mightconsist of pixel values of an image, that are processed by a firstprocessor to produce a second stream of blocks of DCT (Discrete CosineTransformation) coefficients of 8×8 blocks of pixels. A second processormight process the blocks of DCT coefficients to produce a stream ofblocks of selected and compressed coefficients for each block of DCTcoefficients.

FIG. 1 shows a illustration of the mapping of an application to aprocessor as known from the prior art. In order to realise data streamprocessing a number of processors are provided, each capable ofperforming a particular operation repeatedly, each time using data froma next data object from a stream of data objects and/or producing a nextdata object in such a stream. The streams pass from one processor toanother, so that the stream produced by a first processor can beprocessed by a second processor and so on. One mechanism of passing datafrom a first to a second processor is by writing the data blocksproduced by the first processor into the memory.

The data streams in the network are buffered. Each buffer is realised asa FIFO, with precisely one writer and one or more readers. Due to thisbuffering, the writer and readers do not need to mutually synchronizeindividual read and write actions on the channel. Reading from a channelwith insufficient data available causes the reading task to stall. Theprocessors can be dedicated hardware function units which are onlyweakly programmable. All processors run in parallel and execute theirown thread of control. Together they execute a Kahn-style application,where each task is mapped to a single processor. The processors allowmulti-tasking, i.e., multiple Kahn tasks can be mapped onto a singleprocessor.

It is therefore an object of the invention to improve the operation of aKahn-style data processing system.

This object is solved by a data processing system according to claim 1as well as by a data processing method according to claim 24.

The invention is based on the idea to effectively separate communicationhardware, e.g. busses and memory, and computation hardware, e.g.processors, in a data processing system by introducing a communicationmeans for each processor. By introducing this separation the processorscan concentrate on performing their function-specific tasks, while thecommunication means provide the communication support for the respectiveprocessor.

Therefore, a data processing system is provided with a computation, acommunication support and a communication network layer. The computationlayer comprises a first and at least a second processor for processing astream of data objects. The first processor passes a number of dataobjects from a stream to the second processor which can then process thedata objects. The communication network layer includes a memory and acommunication network for linking the first processor and the secondprocessors with said memory. The communication support layer is arrangedbetween the computation layer and the communication network layer andcomprises one communication means for each second processor in thecomputation layer. The communication means of each of the secondprocessors controls the communication between the said second processorand the memory via the communication network in the communicationnetwork layer.

The introduction of the communication means between one of the secondprocessors and the communication network layer provides a clearlydefined system-level abstraction layer in particular by providing anabstraction of communication and memory implementation aspects.Furthermore, a distributed organisation with local responsibilities isrealised whereby the scalability of the system is improved.

In a further embodiment of the invention said communication meanscomprises a reading/writing unit for enabling reading/writing of saidassociated second processor from/into said memory in the communicationnetwork layer, a synchronisation unit for synchronising thereading/writing of said associated second processor and/orinter-processor synchronization of memory access, and/or a taskscheduling unit for scheduling tasks related to the attached processor,for administrating a set of tasks to be handled by said secondprocessor, and/or administrating inter-task communication channels.Accordingly, by providing separate units the reading/writing, thesynchronisation of the reading/writing and the task switching can beindependently controlled by the communication means allowing a greaterfreedom in implementing different applications.

In still a further embodiment of the invention said communication unitis able to handle multiple inbound and outbound streams and/or multiplestreams per task. This has the positive effect that a data streamproduced by one task processed by a second processor can be forwarded toseveral other tasks for further processing and vice versa.

In another embodiment of the invention the communication means iscapable of implementing the same functions for controlling thecommunication between said attached second processor and said memoryindependent of said attached processor. Therefore, the design of thecommunication means can be optimised primarily regarding its specificfunctions which are to be implemented by said communication meansavoiding a strong influence of the design of the second processor.

In a further embodiment of the invention the communication between saidsecond processors and their associated communication means is amaster/slave communication with said second processor acting as master.

In a further embodiment of the invention said communication means insaid communication support layer comprise an adaptable first task-levelinterface towards said associated second processor and a secondsystem-level interface towards said communication network and saidmemory, wherein said first and second interfaces are active concurrentlyor non-concurrently. With the provision of a adaptable task-levelinterface facilitates the re-use of the communication means in theoverall system architecture, while allowing the parameterisation andadoption for specific applications for a specific second processor.

In still a further embodiment of the invention at least one of saidsecond processors is programmable, the first task-level interface of thecommunication means of said one of said second processors is at leastpartly programmable, and wherein part of the functionality of thecommunication means is programmable.

The invention also relates to a method for processing data in a dataprocessing system comprising a first and at least a second processor forprocessing streams of data objects, said first processor being arrangedto pass data objects from a stream of data objects to the secondprocessor; at least one memory for storing and retrieving data objects;and one communication means for each of said second processors, whereina shared access to said first and said second processors is provided,wherein the communication means of each of said second processorscontrolling the communication between said second processor and saidmemory.

The invention further relates to a communication means in a dataprocessing system having a computation layer including a first and atleast one second processor for processing a stream of data objects, saidfirst processor being arranged to pass data objects from a stream ofdata objects to the second processor, a communication network layerincluding a communication network and a memory; and a communicationsupport layer being arranged between said computation layer and saidcommunication network layer. The communication means is adapted to beimplemented operatively between the second processor and thecommunication network, is associated to second processors and controlsthe communication between said second processor and said memory via saidcommunication network in the communication network layer.

Further embodiments of the invention are described in the dependentclaims.

These and other aspects of the invention are described in more detailwith reference to the drawings; the figures showing:

FIG. 1 an illustration of the mapping of an application to a processoraccording to the prior art;

FIG. 2 a schematic block diagram of an architecture of a stream basedprocessing system;

FIG. 3 an illustration of the synchronising operation and an I/Ooperation in the system of FIG. 2;

FIG. 4 an illustration of a cyclic FIFO memory;

FIG. 5 a mechanism of updating local space values in each shellaccording to FIG. 2;

FIG. 6 an illustration of the FIFO buffer with a single writer andmultiple readers;

FIG. 7 a finite memory buffer implementation for a three-station stream;and

FIG. 8 an illustration of reading and administrating the validity ofdata in a cache.

FIG. 2 shows a processing system for processing streams of data objectsaccording to a preferred embodiment of the invention. The system can bedivided into different layers, namely a computation layer 1, acommunication support layer 2 and a communication network layer 3. Thecomputation layer 1 includes a CPU 11, and two processors or processors12 a, 12 b. This is merely by way of example, obviously more processorsmay be included into the system. The communication support layer 2comprises a shell 21 associated to the CPU 11 and shells 22 a, 22 bassociated to the processors 12 a, 12 b, respectively. The communicationnetwork layer 3 comprises a communication network 31 and a memory 32.

The processors 12 a, 12 b are preferably dedicated processor; each beingspecialised to perform a limited range of stream processing. Eachprocessor is arranged to apply the same processing operation repeatedlyto successive data objects of a stream. The processors 12 a, 12 b mayeach perform a different task or function, e.g. variable lengthdecoding, run-length decoding, motion compensation, image scaling orperforming a DCT transformation. In operation each processor 12 a, 12 bexecutes operations on one or more data streams. The operations mayinvolve e.g. receiving a stream and generating another stream orreceiving a stream without generating a new stream or generating astream without receiving a stream or modifying a received stream. Theprocessors 12 a, 12 b are able to process data streams generated byother processors 12 b, 12 a or by the CPU 11 or even streams that havegenerated themselves. A stream comprises a succession of data objectswhich are transferred from and to the processors 12 a, 12 b via saidmemory 32.

The shells 22 a, 22 b comprise a first interface towards thecommunication network layer being a communication layer. This layer isuniform or generic for all the shells. Furthermore the shells 22 a, 22 bcomprise a second interface towards the processor 12 a, 12 b to whichthe shells 22 a, 22 b are associated to, respectively. The secondinterface is a task-level interface and is customised towards theassociated processor 12 a, 12 b in order to be able to handle thespecific needs of said processor 12 a, 12 b. Accordingly, the shells 22a, 22 b have a processor-specific interface as the second interface butthe overall architecture of the shells is generic and uniform for allprocessors in order to facilitate the re-use of the shells in theoverall system architecture, while allowing the parameterisation andadoption for specific applications.

The shell 22 a, 22 b comprise a reading/writing unit for data transport,a synchronisation unit and a task switching unit. These three unitscommunicate with the associated processor on a master/slave basis,wherein the processor acts as master. Accordingly, the respective threeunit are initialised by a request from the processor. Preferably, thecommunication between the processor and the three units is implementedby a request-acknowledge handshake mechanism in order to hand overargument values and wait for the requested values to return. Thereforethe communication is blocking, i.e. the respective thread of controlwaits for their completion.

The reading/writing unit preferably implements two different operations,namely the read-operation enabling the processors 12 a, 12 b to readdata objects from the memory and the write-operation enabling theprocessor 12 a, 12 b to write data objects into the memory 32. Each taskhas a predefined set of ports which correspond to the attachment pointsfor the data streams. The arguments for these operations are an ID ofthe respective port ‘port_id’, an offset ‘offset’ at which thereading/writing should take place, and the variable length of the dataobjects ‘n_bytes’. The port is selected by a ‘port_id’ argument. Thisargument is a small non-negative number having a local scope for thecurrent task only.

The synchronisation unit implements two operations for synchronisationto handle local blocking conditions on reading from an empty FIFO orwriting to an full FIFO. The first operation, i.e. the getspaceoperation, is a request for space in the memory implemented as a FIFOand the second operation, i.e. a putspace operation, is a request torelease space in the FIFO. The arguments of these operations are the‘port_id’ and ‘n-bytes’ variable length.

The getspace operations and putspace operations are performed on alinear tape or FIFO order of the synchronisation, while inside thewindow acquired by the said the operations, random access read/writeactions are supported.

The task switching unit implements the task switching of the processoras a gettask operation. The arguments for these operations are‘blocked’, ‘error’, and ‘task_info’.

The argument ‘blocked’ is a Boolean value which is set true if the lastprocessing step could not be successfully completed because a getspacecall on an input port or an output port has returned false. Accordingly,the task scheduling unit is quickly informed that this task shouldbetter not be rescheduled unless a new ‘space’ message arrives for theblocked port. This argument value is considered to be an advice onlyleading to an improved scheduling but will never affect thefunctionality. The argument ‘error’ is a Boolean value which is set trueif during the last processing step a fatal error occurred inside thecoprocessor. Examples from mpeg decode are for instance the appearanceof unknown variable-length codes or illegal motion vectors. If so, theshell clears the task table enable flag to prevent further schedulingand an interrupt is sent to the main CPU to repair the system state. Thecurrent task will definitely not be scheduled until the CPU interactsthrough software.

The operations just described above are initiated by read calls, writecalls, getspace calls, putspace calls or gettask calls from theprocessor.

FIG. 3 depicts an illustration of the process of reading and writing andits associated synchronisation operations. From the processor point ofview, a data stream looks like an infinite tape of data having a currentpoint of access. The getspace call issued from the processor askspermission for access to a certain data space ahead of the current pointof access as depicted by the small arrow in FIG. 3 a. If this permissionis granted, the processor can perform read and write actions inside therequested space, i.e. the framed window in FIG. 3 b, usingvariable-length data as indicated by the n_bytes argument, and at randomaccess positions as indicated by the offset argument.

If the permission is not granted, the call returns false. After one ormore getspace calls—and optionally several read/write actions—theprocessor can decide if is finished with processing or some part of thedata space and issue a putspace call. This call advances thepoint-of-access a certain number of bytes, i.e. n_bytes2 in FIG. 3 d,ahead, wherein the size is constrained by the previously granted space.

FIG. 4 depicts an illustration of the cyclic FIFO memory. Communicatinga stream of data requires a FIFO buffer, which preferably has a finiteand constant size. Preferably, it is pre-allocated in memory, and acyclic addressing mechanism is applied for proper FIFO behaviour in thelinear memory address range.

A rotation arrow 50 in the centre of FIG. 4 depicts the direction onwhich getspace calls from the processor confirm the granted window forread/write, which is the same direction in which putspace calls move theaccess points ahead. The small arrows 51, 52 denote the current accesspoints of tasks A and B. In this example A is a writer and hence leavesproper data behind, whereas B is a reader and leaves empty space (ormeaningless rubbish) behind. The shaded region (A1, B1) ahead of eachaccess point denote the access window acquired through getspaceoperation.

Tasks A and B may proceed at different speeds, and/or may not beserviced for some periods in time due to multitasking. The shells 22 a,22 b provide the processors 12 a, 12 b on which A and B run withinformation to ensure that the access points of A and B maintain theirrespective ordering, or more strictly, that the granted access windowsnever overlap. It is the responsibility of the processors 12 a, 12 b touse the information provided by the shell 22 a, 22 b such that overallfunctional correctness is achieved. For example, the shell 22 a, 22 bmay sometimes answer a getspace requests from the processor false, e.g.due to insufficient available space in the buffer. The processor shouldthen refrain from accessing the buffer according to the denied requestfor access.

The shells 22 a, 22 b are distributed, such that each can be implementedclose to the processor 12 a, 12 b that it is associated to Each shelllocally contains the configuration data for the streams which areincident with tasks mapped on its processor, and locally implements allthe control logic to properly handle this data. Accordingly, a localstream table is implemented in the shells 22 a, 22 b that contains a rowof fields for each stream, or in other words, for each access point.

To handle the arrangement of FIG. 4, the stream table of the processorshells 22 a, 22 b of tasks A and B each contain one such line, holding a‘space’ field containing a (maybe pessimistic) distance from its ownpoint of access towards the other point of access in this buffer and anID denoting the remote shell with the task and port of the otherpoint-of-access in this buffer. Additionally said local stream table maycontain a memory address corresponding to the current point of accessand the coding for the buffer base address and the buffer size in orderto support cited address increments.

These stream tables are preferably memory mapped in small memories, likeregister files, in each of said shells 22. Therefore, a getspace callcan be immediately and locally answered by comparing the requested sizewith the available space locally stored. Upon a putspace call this localspace field is decremented with the indicated amount and a putspacemessage is sent to the another shell which holds the previous point ofaccess to increment its space value. Correspondingly, upon reception ofsuch a put message from a remote source the shell 22 increments thelocal field. Since the transmission of messages between shells takestime, cases may occur where both space fields do not need to sum up tothe entire buffer size but might momentarily contain the pessimisticvalue. However this does not violate synchronisation safety. It mighteven happen in exceptional circumstances that multiple messages arecurrently on their way to destination and that they are serviced out oforder but even in that case the synchronisation remains correct.

FIG. 5 shows a mechanism of updating local space values in each shelland sending ‘putspace’ messages. In this arrangement, a getspacerequest, i.e. the getsspace call, from the processor 12 a, 12 b can beanswered immediately and locally in the associated shell 22 a, 22 b bycomparing the requested size with the locally stored space information.Upon a putspace call, the local shell 22 a, 22 b decrements its spacefield with the indicated amount and sends a putspace message to theremote shell. The remote shell, i.e. the shell of another processor,holds the other point-of-access and increments the space value there.Correspondingly, the local shell increments its space field uponreception of such a putspace message from a remote source.

The space field belonging to point of access is modified by two sources:it is decrement upon local putspace calls and increments upon receivedputspace messages. It such an increment or decrement is not implementedas atomic operation, this could lead to erroneous results. In such acase separated local-space and remote-space field might be used, each ofwhich is updated by the single source only. Upon a local getspace callthese values are then subtracted. The shells 22 are always in control ofupdates of its own local table and performs these in an atomic way.Clearly this is a shell implementation issue only, which is not visibleto its external functionality.

If getspace call returns false, the processor is free to decide an howto react. Possibilities are, a) the processor my issue a new getspacecall with a smaller n_bytes argument, b) the processor might wait for amoment and then try again, or c) the processor might quit the currenttask and allow another task on this processor to proceed.

This allows the decision for task switching to depend upon the expectedarrival time of more data and the amount of internally accumulated statewith associated state saving cost. For non-programmable dedicatedhardware processors, this decision is part of the architectural designprocess.

The implementation and operation of the shells 22 do not to makedifferentiations between read versus write ports, although particularinstantiations may make these differentiations. The operationsimplemented by the shells 22 effectively hide implementation aspectssuch as the size of the FIFO buffer, its location in memory, anywrap-around mechanism on address for memory bound cyclic FIFO's, cachingstrategies, cache coherency, global I/O alignment restrictions, data buswidth, memory alignment restrictions, communication network structureand memory organisation.

Preferably, the shell 22 a, 22 b operate on unformatted sequences ofbytes. There is no need for any correlation between the synchronisationpacket sizes used by the writer and a reader which communicate thestream of data. A semantic interpretation of the data contents is leftto the processor. The task is not aware of the application graphincidence structure, like which other tasks it is communicating to andon which processors these tasks mapped, or which other tasks are mappedon the same processor.

In high-performance implementations of the shells 22 the read call,write call, getspace call, putspace calls can be issued in parallel viathe read/write unit and the synchronisation unit of the shells 22 a, 22b. Calls acting on the different ports of the shells 22 do not have anymutual ordering constraint, while calls acting on identical ports of theshells 22 must be ordered according to the caller task or processor. Forsuch cases, the next call from the processor can be launched when theprevious call has returned, in the software implementation by returningfrom the function call and in hardware implementation by providing anacknowledgement signal.

A zero value of the size argument, i.e. n_bytes, in the read call can bereserved for performing pre-fetching of data from the memory to theshells cache at the location indicated by the port_ID- andoffset-argument. Such an operation can be used for automaticpre-fetching performed by the shell. Likewise, a zero value in the writecall can be reserved for a cache flush request although automatic cacheflushing is a shell responsibility.

Optionally, all five operations accept an additional last task_IDargument. This is normally the small positive number obtained as resultvalue from an earlier gettask call. The zero value for this argument isreserved for calls which are not task specific but relate to processorcontrol.

In the preferred embodiment the set-up for communication a data streamis a stream with one writer and one reader connected to the finite-sizeof FIFO buffer. Such a stream requires a FIFO buffer which has a finiteand constant size. It will be pre-allocated in memory and in its linearaddress range is cyclic addressing mechanism is applied for proper FIFObehaviour.

However in a further embodiment based on FIG. 2 and FIG. 6, the datastream produced by one task is to be consumed by two or more differentconsumers having different input ports. Such a situation can bedescribed by the term forking. However, we want to re-use the taskimplementations both for multi-tasking hardware processors as well asfor software task running on the CPU. This is implemented through taskshaving a fixed number of ports, corresponding to their basicfunctionality and that any needs for forking induced by applicationconfiguration are to be resolved by the shell.

Clearly stream forking can be implemented by the shells 22 by justmaintaining two separate normal stream buffers, by doubling all writeand putspace operations and by performing an AND-operation on the resultvalues of doubled getspace checks. Preferably, this is not implementedas the costs would include a double write bandwidth and probably morebuffer space. Instead preferably, the implementation is made, with twoor more readers and one writer sharing the same FIFO buffer.

FIG. 6 shows an illustration of the FIFO buffer with a single writer andmultiple readers. The synchronisation mechanism must ensure a normalpair wise ordering between A and B next to a pair wise ordering betweenA and C, while B and C have no mutual constraints, e.g. assuming theyare pure readers. This is accomplished in the shell associated to theprocessor performing the writing operation by keeping track of availablespace separately for each reader (A to B and A to C). When the writerperforms a local getspace call its n_bytes argument is compared witheach of these space values. This is implemented by using extra lines insaid stream table for forking connected by one extra field or column toindicate changing to a next line.

This provides a very little overhead for the majority of cases whereforking is not used and at the same time does not limit forking totwo-way only. Preferably, forking is only implemented by the writer andthe readers are not aware of this situation.

In a further embodiment based on FIG. 2 and FIG. 7, the data stream isrealised as a three station stream according to the tape-model. Eachstation performs some updates of the data stream which passers by. Anexample of the application of the three station stream is one writer,and intermediate watchdog and the final reader. In such example of thesecond task preferably watches the data that passes and may be inspectssome while mostly allowing the data to pass without modification.Relatively infrequently it could decide to change a few items in thestream. This can be achieved efficiently by in-place buffer updates by aprocessor to avoid copying the entire stream contents from one buffer toanother. In practice this might be useful when hardware processors 12communicate and the maim CPU 11 intervenes to modify the stream tocorrect hardware flaws, to do adaptation towards slightly differentstream formats, or just for debugging reasons. Such a set-up could beachieved with all three processors sharing the single stream buffer inmemory, to reduce memory traffic and processor workload. The task B willnot actually read or write the full data stream.

FIG. 7 depicts a finite memory buffer implementation for a three-stationstream. The proper semantics of this three-way buffer includemaintaining a strict ordering of A, B and C with respect to each otherand ensuring no overlapping windows. In this way the three-way buffer isa extension from the two-way buffer shown in FIG. 4. Such a multi-waycyclic FIFO is directly supported by the operations of the shells asdescribed above as well as by the distributed implementation style withputspace messages as discussed in the preferred embodiment. There is nolimitation to just three stations in a single FIFO. In-place processingwhere one station both consumes and produces useful data is alsoapplicable with only two stations. In this case both tasks performingin-place processing to exchange data with each other and no empty spaceis left in the buffer.

In the further embodiment based on FIG. 2 the single access to buffer isdescribed. Such a single access buffer comprises only a single port. Inthis example no data exchange between tasks or processors will beperformed. Instead, it is merely an application of the standardcommunication operations of said shells for local use. The set-up of theshells consists of the standard buffer memory having a single accesspoint attached to it. The task can now use the buffer as a localscratchpad or cache. From the architectural point of view this can haveadvantages such as the combined uses of larger memory for severalpurposes and tasks and for example the use of the software configurablememory size. Besides the use as scratchpad memory to serve the taskspecific algorithm of this set-up is well applicable for storing andretrieving tasks states in the multi-tasking processor. In this caseperforming read/write operations for state swapping is not part of thetask functional code itself but part of the processor control code. Asthe buffer is not used to communicate with other tasks it is normally noneed to perform the put space and getspace operations on this buffer.

In a further embodiment based on FIG. 2 and FIG. 8, the shells 22according to the preferred embodiment further comprise a data cache fordata transport, i.e. read operation and write operations, between theprocessors 12 and the communication network 31 and the memory 32. Theimplementation of a data cache in the shells 22 provide a transparenttranslation of data bus widths, a resolvement of alignment restrictionson the global interconnect, i.e. the communication network 31, and areduction of the number of I/O operations on the global interconnect.

Preferably, the shells 22 comprise the cache in the read and writeinterfaces, however this these caches are invisible from the applicationfunctionality point of view. Here, the mechanism to use of the putspaceand getspace operations is used to explicitly control cache coherence.The caches a play an important role in the decoupling the processor readand write ports from the global interconnect of the communicationnetwork 3. These caches have the major influence on the systemperformance regarding speed, power and area.

The access the window on stream data which is granted to a task port isguaranteed to be private. As a result read and write operations in thisarea are save and at first side do not need intermediate intra-processorcommunication. The access window is extended by means of local getspacerequest obtaining new memory space from a predecessor in the cyclicFIFO. If some part of the cache is tagged to correspond to such anextension and the task may be interested in reading the data in thatextension than such part of the cache needs invalidation. It then latera read operation occurs on this location a cache miss occurs and freshvalid data is loaded into the cache. A elaborate shell implementationcould use the get space to issue the pre-fetch request to reduce cachemiss penalty. The access window is shrunk by means of local putspacerequest leaving new memory space to a successor in the cyclic FIFO. Ifsome part of such a shrink happens to be in the cache and that part hasbeen written, i.e. is dirty or unusable, than such part of the cacheneeds to be flushed to make the local data available to otherprocessors. Sending the putspace message out to another processor mustbe postponed until the cache flush is completed and safe ordering ofmemory operations can be guaranteed.

Using only local getspace and putspace events for explicit cachecoherency control is relatively easy to implement in a large systemarchitectures in comparison with other generic cache coherencymechanisms such as a bus snooping. Also it does not provide thecommunication overhead like for instance a cache write-througharchitecture.

The getspace and putspace operations are defined to operate at bytegranularity. A major responsibility of the cache is to hide the globalinterconnect data transfer size and the data transfer alignmentrestrictions for the processor. Preferably, the data transfer size isset to 16 bytes on ditto alignment, whereas synchronised data quantitiesas small as 2 bytes may be actively used. Therefore, the same memoryword or transferred unit can be stored simultaneously in the caches ofdifferent processors and invalidate information is handled in each cacheat byte granularity.

FIG. 8 shows the reading and administrating of the validity of data in acache in three different situations. In this figure each of thesituations assumes that the read request occurs on an empty cacheresulting in a cache miss. FIG. 8 a indicates the read request thatleads to fetching a memory transfer unit 800, i.e. a word, which isentirely contained inside the granted window 810. Clearly this wholeword is valid in memory and no specific (in-)validation measurements arerequired.

In the FIG. 8 b the fetched word 801 partially extends beyond the space811 acquired by the processor but remains inside the space that islocally administrated in the shell as available. If only the getspaceargument would be used this word would become partially declared invalidand it would need to be re-read once the getspace window is extended.However, if the actual value of available space is checked the entireword can be marked as valid.

In FIG. 8 c the fetched word 802 partially extends into space 820 whichis not known to be saved and might still become written by some otherprocessor. Now it is mandatory to mark this area in the word as invalidwhen it is loaded into the cache. If this part of the word gets accessedlater the word needs to be re-read since the unknown part could ingeneral also extend in this word at the left of the current point ofaccess.

Furthermore a single read request could cover more than one memory wordeither because it crosses the boundary between two successive word orbecause the read interface of the processor is wider than the memoryword. FIG. 8 shows memory words which are relatively large in comparisonwith the requested buffer space. In practice the requested windows wouldoften be much larger, however in an extreme case the entire cycliccommunication buffer could also be as small as a single memory word.

In cache coherency control there are tight relations between thegetspace, the read operation and (in-)invalid marks, as well as betweenputspace, write operations and dirty marks and cache flushes. In a‘Kahn’-style application of ports have had dedicated direction eitherinput or output. Preferably, the separated read and write caches areused which simplifies some implementation issues. As for many streamsthe processors will linearly work through cyclic address space, the readcaches optionally support pre-fetching and the write caches optionallysupport the pre-flushing, within two read access moves to the next wordthe cache location of the previous word can be made available forexpected future use. Separate implementations of the read and write datapath also more easily supports read and write requests from theprocessor occurring in parallel for instance in a pipelined processorimplementation.

Also the processors write data at byte granularity and cacheadministrates dirty bits per bite in the cache. Upon the putspacerequest of the cache flushes those words from the cache to their sharedmemory which overlap with the address range indicated by this request.The dirty bits are to be used for the write mask in the bus writerequests to assure that the memory is never written at byte positionsoutside the access window.

In another embodiment based on FIG. 2, the synchronisation units in theshell 22 a are connected to other synchronisation units in another shell22 b. The synchronization units ensures that one processor does notaccess memory locations before valid data for a processed stream hasbeen written to these memory locations. Similarly, synchronizationinterface is used to ensure that the processor 12 a does not overwriteuseful data in memory 32. Synchronization units communicate via asynchronization message network. Preferably, they form part of a ring,in which synchronization signals are passed from one processor to thenext, or blocked and overwritten when these signals are not needed atany subsequent processor. The synchronization units together form asynchronization channel. The synchronization unit maintain informationabout the memory space which is used for transferring the stream of dataobjects from processor 12 a to processor 12 b.

1. A data processing system that processes data in layers of a layeredcommunication protocol where data is passed by progressing through thelayers sequentially, the system comprising: a computation layerincluding a first general-purpose processor and at least one secondprocessor for processing a stream of data objects according to atask-specific configuration implemented in the second processor, saidfirst processor being arranged to pass data objects from a stream ofdata objects to the second processor via the layered communicationprotocol; a communication network layer including a communicationnetwork and a memory; and a communication support layer including onecommunication unit for each of said second processors, saidcommunication support layer being arranged between said computationlayer and said communication network layer; wherein the communicationunit of each of said second processors controls the communicationbetween said second processor and said memory via said communicationnetwork in the communication network layer.
 2. Data processing systemaccording to claim 1, wherein said second processor is a multi-taskingprocessor, capable of interleaved processing of a first and second task,said first and second tasks process a first and second stream of dataobjects, respectively.
 3. Data processing system according to claim 2,wherein said communication units are arranged to handle multiple inboundand outbound streams and multiple streams per task.
 4. Data processingsystem according to claim 1, wherein each of said communication unitscomprises: a reading/writing unit for enabling reading/writing of saidassociated second processor from/into said memory in the communicationnetwork layer, a synchronization unit for synchronizing thereading/writing of said associated second processor and/orinter-processor synchronization of memory access, and/or a taskscheduling unit for scheduling tasks related to the attached processor,for administrating a set of tasks to be handled by said secondprocessor, and/or administrating inter-task communication channels. 5.Data processing system according to claim 1, wherein the communicationunit is adopted to control the communication between said secondprocessor and said memory independent of said processor.
 6. Dataprocessing system according to claim 1, wherein said communication unitprovides functionality for mapping transported data into memory ranges.7. Data processing system according to claim 1, wherein thecommunication between said second processors and their associatedcommunication unit is a master/slave communication, said secondprocessors acting as masters.
 8. Data processing system according toclaim 1, wherein said second processors being function-specificdedicated processors for performing a range of stream processing tasks.9. Data processing system according claim 1, wherein said communicationunit in said communication support layer comprise an adaptable firsttask-level interface towards said associated second processor in saidcomputation layer and a second system-level interface towards saidcommunication network and said memory, wherein said first and secondinterfaces are active concurrently or non-concurrently.
 10. Dataprocessing system according claim 1, wherein at least one of said secondprocessors is programmable, the first task-level interface of thecommunication unit of said one of said second processors is at leastpartly programmable, and wherein part of the functionality of thecommunication unit is programmable.
 11. Data processing system accordingclaim 1, wherein said communication units comprise additional interfacesfor exchanging control information and/or synchronization informationdirectly with other communication units in said data processing system.12. Data processing system according claim 11, wherein saidcommunication units are connected via their additional interfaces in atoken ring arrangement.
 13. Data processing system according claim 1,wherein each said communication unit is adopted to handle multi-castingof an output stream to more than one receiving second processor withoutnotifying the sending second processor thereof.
 14. Data processingsystem according claim 1, wherein each said communication unit isadopted to hide implementation aspects of the communication network tothe associated second processor.
 15. A communication unit in a dataprocessing system that processes data in layers of a layeredcommunication protocol where data is passed by progressing through thelayers sequentially, comprising: a computation layer including a firstgeneral-purpose processor and at least one second processor forprocessing a stream of data objects according to a task-specificconfiguration implemented in the second processor, said firstgeneral-purpose processor being arranged to pass data objects from astream of data objects to the second processor via the layeredcommunication protocol, a communication network layer including acommunication network and a memory; and a communication support layerbeing arranged between said computation layer and said communicationnetwork layer; wherein the communication unit is associated to thesecond processors and controls the communication between said secondprocessor and said memory via said communication network in thecommunication network layer.
 16. A communication unit according to claim15, wherein said communication unit is arranged to handle multipleinbound and outbound streams and/or multiple streams per task.
 17. Acommunication unit according to claim 15, further comprising areading/writing unit for enabling reading/writing of said associatedsecond processor from/into said memory in the communication networklayer, a synchronization unit for synchronizing the reading/writing ofsaid associated second processor and/or inter-processor synchronizationof memory access, and/or a task scheduling unit for scheduling tasksrelated to the attached processor, for administrating a set of tasks tobe handled by said second processor, and/or administrating inter-taskcommunication channels.
 18. A communication unit according to claim 15,wherein the communication between the communication unit and the secondprocessor is a master/slave communication, wherein said secondprocessors acting as masters.
 19. A communication unit according claim15, further comprising an adaptable first task-level interface towardssaid associated second processor in said computation layer and a secondsystem-level interface towards said communication network and saidmemory, wherein said first and second interfaces are active concurrentlyor non-concurrently.
 20. A communication unit according claim 15,wherein the first task-level interface is at least partly programmable,and wherein part of the functionality of the communication unit isprogrammable.
 21. A communication unit according claim 15, furthercomprising additional interfaces for exchanging control informationand/or synchronization information directly with other communicationunits in said data processing system.
 22. A communication unit accordingclaim 15, wherein said communication unit is adopted to handlemulti-casting of an output stream to more than one receiving secondprocessor without notifying the sending second processor thereof.
 23. Acommunication unit according claim 15, wherein said communication unitis adopted to hide implementation aspects of the communication networkto the associated second processor.
 24. Method for processing data in adata processing system that processes data in layers of a layeredcommunication protocol where data is passed by progressing through thelayers sequentially, the system having: a computation layer including afirst general-purpose processor and at least a second processor forprocessing a stream of data objects according to a task-specificconfiguration implemented in the second processor, said first processorbeing arranged to pass data objects from a stream of data objects to thesecond processor via the layered communication protocol; a communicationnetwork layer including a communication network and at least one memoryfor storing and retrieving data objects; and a communication supportlayer including one communication unit for each of said secondprocessors, said communication support layer being arranged between saidcomputation layer and said communication layer, wherein a shared accessto said first general-purpose processor and said second processors isprovided, said method comprising the steps of: the communication unit ofeach of said second processors controlling the communication betweensaid second processor and said memory via said communication network inthe communication network layer.
 25. Method for processing dataaccording to claim 24, further comprising the step of said communicationunit handling multiple inbound and outbound streams and/or multiplestreams per task.
 26. Method for processing data according to claim 24,wherein said controlling step comprises the steps of: enablingreading/writing of said associated second processor from/into saidmemory in the communication network layer, synchronizing thereading/writing of said associated second processor and/orinter-processor synchronization of memory access, and/or schedulingtasks related to the attached processor, administrating a set of tasksto be handled by said second processor, and/or administrating inter-taskcommunication channels, wherein said controlling step is carried out onan interconnected circuit arrangement, with the respective layers beingprocessing layers within the circuit arrangement.
 27. Method forprocessing data according to claim 24, wherein the communication betweensaid second processors and their associated communication units is amaster/slave communication, said second processors acting as masters.28. Method for processing data according claim 24, wherein saidcommunication unit handling multi-casting of an output stream to morethan one receiving second processor without notifying the sending secondprocessor thereof.
 29. Method for processing data according claim 24,wherein said communication unit hiding implementation aspects of thecommunication network to the associated second processor.
 30. The dataprocessing system of claim 1, wherein the data processing system isimplemented on a common circuit arrangement for processing video dataand generating an output video stream.
 31. The communication unit ofclaim 15, wherein each layer is implemented on a common circuitarrangement for processing video data and generating an output videostream.